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Abstract 

Background: Parrots belong to a group of behaviorally advanced vertebrates and have an advanced ability of 
vocal learning relative to other vocal-learning birds. They can imitate human speech, synchronize their body 
movements to a rhythmic beat, and understand complex concepts of referential meaning to sounds. However, 
little is known about the genetics of these traits. Elucidating the genetic bases would require whole genome 
sequencing and a robust assembly of a parrot genome. 

Findings: We present a genomic resource for the budgerigar, an Australian Parakeet {Melopsittacus undulatus) - the 
most widely studied parrot species in neuroscience and behavior. We present genomic sequence data that includes 
over 300x raw read coverage from multiple sequencing technologies and chromosome optical maps from a 
single male animal. The reads and optical maps were used to create three hybrid assemblies representing 
some of the largest genomic scaffolds to date for a bird; two of which were annotated based on similarities 
to reference sets of non-redundant human, zebra finch and chicken proteins, and budgerigar transcriptome 
sequence assemblies. The sequence reads for this project were in part generated and used for both the 
Assemblathon 2 competition and the first de novo assembly of a giga-scale vertebrate genome utilizing PacBio 
single-molecule sequencing. 

Conclusions: Across several quality metrics, these budgerigar assemblies are comparable to or better than the 
chicken and zebra finch genome assemblies built from traditional Sanger sequencing reads, and are sufficient to 
analyze regions that are difficult to sequence and assemble, including those not yet assembled in prior bird 
genomes, and promoter regions of genes differentially regulated in vocal learning brain regions. This work 
provides valuable data and material for genome technology development and for investigating the genomics of 
complex behavioral traits. 

Keywords: Melopsittacus undulatus, Budgerigar, Parakeet, Next-generation sequencing, Hybrid assemblies, 
Optical maps, Vocal learning 
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Data description 

Raw genome DNA sequence reads 

DNA samples were obtained from a blood sample taken 
from a single male Melopsittacus undulatus, who we aptly 
named Mr. B. For Illumina sequencing, reads were gener- 
ated at Duke University (16x), Illumina UK (54x), and BGI 
(219x) using Illumina's TruSeq [1] version2 or version3 
chemistries (Table 1 and GigaDB [2]). The version3 chem- 
istry reads through GC-rich regions, which are often found 
in promoters, more evenly than does version2 [3], The in- 
sert sizes for the BGI libraries ranged from 220 bp to 
40 Kbp, and the insert sizes for the Duke libraries ranged 
from 400-600 bp, in order to assist assemblies. Fragment 
sizes for the mate pair libraries, based on genome mapping, 
and the per base sequence quality distribution for the li- 
braries are shown in GigaDB [2] . The Duke University Illu- 
mina libraries were sequenced at two different cluster 
densities: 8x coverage reads at the normal 420 k clusters/ 
mm density and 8x coverage at a lower 350 k clusters/mm. 
The lower cluster density was used to increase the number 
of GC-rich regions sequenced. For PacBio sequencing, 6.76 
Gbp (~5.5x coverage) of PacBio RS reads [4] were gener- 
ated at Pacific Biosciences from two insert size libraries 
(7.5 K bp at 1.93x and 13 Kbp at 3.56x; PacBio reads 
error-corrected with Illumina can be downloaded from the 
supplementary webpage associated with [5]). With all reads 
combined, the total coverage exceeds 300 x (assuming a 
haploid genome size of 1.23 Gbp) (Table 1), perhaps mak- 
ing Mr. B one of the most sequenced individual vertebrate 
animals as of to date. The read length distributions of these 
different types of reads are shown in Figure 1. 

Fosmid Library 

To validate the assemblies in the Assemblathon 2 com- 
petition, a fosmid library was created from sheared gen- 
omic DNA (35-40 Kbp) of Mr. B [6]. Ten pools of 
clones were generated and sequenced using Illumina as 
described in [7]. Each pool of reads was individually as- 
sembled using Velvet [8]. The fosmid assemblies have 
been deposited at GigaDB [2]). 

Transcriptome Reads 

454 FLX transcriptome reads were generated from 
brain RNA isolated from two males, neither of whom 

Table 1 Summary of genomic reads 

Library sizes 

454 Shotgun, 3 kb, 8 kb, 20 kb mate pair 

Illumina 220, 230, 500, 400-600, 800, 2 kb, 

5 kb, 10 kb, 20 kb, 40 kb paired end 

Pacific Biosciences 7.5Kb, 13 kb 

Combined 



was Mr. B. An initial set of sequencing runs of both males 
was conducted at Washington University at St. Louis, pro- 
ducing 89.2 Mb of transcriptome sequence as reported in 
[9] (NCBI accession numbers SRR029329-30) and were 
assembled using Newbler [10] into 19,198 contigs. An 
additional 21x coverage (run label GK0K2XF01) was gen- 
erated at Duke University from one of the males. 

Assemblies 

We present three hybrid assemblies: 1) Budgerigar 
454-illumina hybrid v6.3 using the CABOG assem- 
bler; 2) Budgerigar PBcR hybrid using the CABOG as- 
sembler; and 3) Budgerigar illumina-454 hybrid using 
the SOAPdenovo2 assembler. The first two assemblies 
were annotated, after which, optical-map assisted mega- 
scaffolds were constructed based on them. As of yet, the 
SOAPdenovo2 assemblies have not been annotated or 
aligned to optical maps. The quality statistics of these as- 
semblies are in listed in Table 2, and brief descriptions of 
their construction and relative quality are provided in 
Additional file 1. 

Validating sequence assemblies with optical maps 

Optical Mapping is a single molecule system for the 
construction of ordered restriction maps of whole ge- 
nomes [11], and it has been used to guide and validate 
sequence assemblies [12]. An optical map for the bud- 
gerirgar genome was created, using a method described 
in Additional file 1. The optical map contigs ranged in 
size from 2 Mbp to 74 Mbp and spanned over 900 Mbp 
with a resolution of 13.94 Kbp (i.e., one non-redundant 
Swal every 13.94 Kbp). The contigs were then aligned to 
in silico restriction maps generated from Budgerigar_v6.3 
and PBcR assembly scaffolds in order to validate the scaf- 
folds. An approximate 859.21 Mb of the optical maps 
aligned to the Budgerigar_v6.3 assembly, in 146 scaffolds 
with 3 or more Swal restriction fragments (excluding ends 
and fragments less than 0.4 Kbp). Of these 146 scaffolds, 
43 appeared chimeric (i.e., aligned to two or more optical 
map contigs). For the PBcR assembly, 796.63 Mbp optical 
map contigs aligned, in 673 scaffolds. Of the 673 scaffolds, 
only 51 were chimeric. Thus, while the Budgerigar_v6.3 
assembly has a higher N50 scaffold metric and hence lon- 
ger scaffolds compared to the PBcR assembly, 30% the 



Total reads Total BP (Mb) Coverage (assuming 1.23 Gbp 

genome size) 

41,898,557 19,736 15.4x 

561,074,047 356,597 289x 

4,176,242 6,763 5.5x 

607,148,846 383,096 309.9X 
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Read Length (5bp Buckets) Log-Scaled 

Figure 1 The distribution of read lengths in 454, lllumina and 
PacBio budgerigar sequences. The reads are binned into 5 bp 
buckets based on their lengths, and the fraction of reads 
(normalized by the size of the largest bucket) falling into each 
bucket is shown. Thus, curves shifted towards the right indicate 
longer read lengths. The reads labeled "20 Kbp", "8 Kbp" and "3 Kbp", 
"FLX Titanium" and "FLX Titanium XL+" are 454 reads. The reads 
labeled "PacBio pre-release C2" are uncorrected PacBio reads. The 
lllumina read lengths appear as colored square boxes, since these 
read lengths are uniform. The "lllumina Duke" reads are of length 76, 
The "lllumina UK" reads are of length 101, and the "llllumina BGI" 
reads are of lengths 90 or 150. The longest reads come from 
PacBio sequencing, followed by 454 FLX + (i.e., FLX Titanium 
XL+) sequencing. 



v6.3 scaffolds are chimeric, whereas only 7.6% of the PBcR 
assembly are chimeric. 

Optical map assisted assemblies 

We took both Budgerigar_v6.3 and PBcR assemblies and 
filtered out alignments that did not extend to the end of 
either the genomic sequence scaffold or the optical map. 
The remaining high-quality alignments were then used 
to identify optical map alignments that bridged scaffolds, 
such that a single optical map aligned to the ends of at 
least two sequence scaffolds. We then iteratively ex- 
tended the megascaffolds beyond pairs of sequence scaf- 
folds, using three heuristics: (1) we limited the overhangs 
(i.e., the portion of the scaffold sequence that does not 
align to the optical map) to 2 Mbp total; (2) we bridged 
two scaffolds together only if the size of the gap separating 
them is less than 2 Mbp of Ns; and (3) if a sequence scaf- 
fold aligned to more than one optical map, we placed it 
into the largest one it aligns with. The above procedure 
slightly reduced the number of scaffolds from 25,212 to 



25,163 in the Budgerigar_v6.3 assembly, and from 54,668 
to 54,138 in the PBcR assembly. This relatively small 
change in number is expected as our procedure tended to 
join only sequence scaffolds that were already fairly large 
into even larger megascaffolds, since it is only possible to 
confidently align an optical map to a fairly large sequence 
scaffold bearing numerous Swal restriction sites. However, 
this analysis substantially improved the scaffold N50 sizes 
from 10.6 Mbp to 13.8 Mbp in the Budgerigar_v6.3, and 
1.7 Mbp to 7.3 Mbp in the PBcR assemblies, respectively 
(Table 2). Without limiting the length of the overhangs 
and gap sizes to 2 Mbp, the increase in N50 scaffold sizes 
in the Budgerigar_v6.3 is 17.1 Mbp (which we think could 
be an artifact). We speculate that some of the large gaps 
in the optical map correspond to centromeres or highly 
repetitive DNA that are difficult to assemble. 

Annotations 

The Budgerigar_v6.3 and PBcR assemblies were annotated 
at BGI for protein coding genes by first generating a refer- 
ence set of human, chicken and zebra finch proteins, and 
then aligning the reference set to the assemblies, and 
propagating annotations to 30% coverage of the reference 
at TBlastN, E = le . For the Bud gerigar_v6.3 assembly, 
the reference set comprised of human proteins from 
Ensembl 60 and a set of zebra finch and chicken proteins 
re-annotated based on these human proteins, using a cus- 
tom BGI pipeline reported on separately (Jarvis et al. in 
preparation; Zhang et al., in preparation). For the PBcR 
assembly, the reference set comprised of the Ensembl 
60 human, chicken and zebra finch proteins. The 
propagation of these reference sets to the budgerigar 
assemblies is described in more detail in Additional 
file 1. Further, in the PBcR assembly, UTRs were an- 
notated for 6,203 genes using the GK0K2XF01 tran- 
scriptome runs with a pipeline similar to the one 
described in [13]. The assembly annotations were then 
propagated to the corresponding sets of megascaffolds. 
No de novo gene annotations were performed. 

The annotated Budgerigar assemblies had fewer genes 
(15,470 and 16,204 genes in the Budgerigar_v6.3 and 
PBcR assemblies respectively) than the published Zebra 
Finch (18,618 genes) and Chicken genome assemblies 
(17,108 genes in the 2011 Galgal4 assembly [14]). We 
believe the lower number of annotated genes in budgeri- 
gar assemblies is due to the differences in annotation 
methods rather than assembly completeness, for two 
reasons: (1) These annotations were produced based on 
similarities to zebra finch, chicken and human proteins, 
and hence they cannot contain more genes than the 
source genome annotations; and (2) The independent 
GenScan annotation of the Budgerigar_v6.3 assembly at 
the UCSC Genome Browser contains more genes than 
in zebra finch and chicken, 24,095 in total. 



Table 2 Summary of assemblies 



Budgerigar v6.3 PBcR Megascaffolds from Megascaffolds from lllumina + 454 Zebra Chicken 

Budgerigar_v6.3 + Optical PBcR + Optical Map SOAPdenovo2 Finch [15] v4[13]* 
Map 



Chicken Peregrine Puerto Rican Macaw 
v3[16] Falcon [17] Parrot [21] 1.1 [20] 



Assembler 



Sequence 
method 



Coverage 

Genome 
size 

Total bases 
in scaffolds 

Number of 
scaffolds 

Avg. scaffold 
size 

NSO scaffold 
size 

Largest scaffold 
size 

Total gaps in 
scaffolds 

Number of 
Contigs 

Avg. contig 
size 

N50 contig size 
Largest contig 



Celera CABOG 
[25] 

454 FLX, FLX+, 
lllumina 



14x 

1.2Gbp 

1,117,358,947 

25,212 

44,319 

10,614387 

39,887,647 

51,150 

70,863 

15,334 

55,633 
465,633 



PBcR assembler 
[5] 

PacBio corrected 
with lllumina, 
454 FLx, FLx+ 

17x 

1 ,2Gbp 

1,219,132,003 



22,300 

I, 705,751 

II, 564,683 

26,444 

77,556 

15,344 

102,885 
849,044 



454 FLX, FLX+, 
lllumina, Optical 
Maps. 



UGbp 

1,118,758,630 

25,163 

44,460 

13,823,040 

61,483,320 

51,295* 

NA 

NA 

NA 
NA 



PacBio corrected 
with lllumina, 454 
FLx, FLx+, Optical 
Maps. 



1.2Gbp 

1,241,439,339 

54,138 

22,931 

7,280,340 

33,208,800 

27,118 

NA 

NA 

NA 
NA 



SOAPdenovo2 PCAP [27] NA 
[26] 



PCAP [27] SOAPdenovo Ray [30] CLC Genomics 



llumina, 454 
FLx+ 



137.59 lllumina, 
6.85 FLx+ 

UGbp 

1,169,860,945 

151,393 

7,727 

13,497,021 

66,566,439 

60810 

212,203 

4664 

51,034 
500,974 



Sanger Sanger, 454 



[28,29] 

Sanger v2.1 lllumina 



Workbench 



lllumina, 

454 FLx+ 



7.1 xs 107x 
1.05Gbp 1.2Gbp 



6x 19.1x 
UGbp UGbp 
1,224,525,252 1,046,932,099 1,047,124,295 1,174,046,505 1,164,566,833 997,000 



26.9X 
1.58Gbp 



26X 
1.2 Gbp 



37,e 



32,482 



15,932 



65,713 



23,776 



44,041 



21,224 



55,317 



148,255 



140,453 



7,855 Not available 



10,409,499 90,216,835 

56,620,707 195,276,750 

124,736 NA 

126,053 27,027 

9,714 38,736 



11,125,310 3,891,469 19,470 15,968 

51,053,708 18,327,016 206,462 177,843 



38,549 
424,635 



279,750 
NA 



NA 



85,191 



12,291 



45,280 
624,663 



77,368 Not available Not available 



98,540 

11,914 

28,599 
247,807 



259,423 



214,754 



4,304 Not available 



6,983 
75,003 



6,366 
87,225 



The Chicken v4 assembly consists of chromosomes and not scaffolds with explains the very high scaffold length statistics. 

*The increased number of gaps in megascaffolds reflects the fact that each megascaffold may be merger of many original scaffolds with gaps in between them. 
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Comparisons to other avian assemblies 

Our budgerigar genome assemblies were compared with 
the zebra finch, chicken, and falcon genomes [15-17]. 
The other assemblies from the Assemblathon 2 competi- 
tion are available from GigaDB [18]. The zebra finch and 
chicken had similar contig and scaffold N50 values 
(38.5 kb and 10.4 Mb for zebra finch, and 279.8 kb and 
90.2 Mb for chicken, respectively). In addition, since the 
Peregrine Falcon is the closest relative to parrots [19], 
we also compared the budgerigar genome assemblies to 
this bird. However, it was not possible to do an in depth 
comparison of these genomes to the recently sequenced 
Scarlet Macaw and Puerto Rican Parrot genomes [20,21], 
because both bird genomes had N50 scaffold sizes under 
20,000 and N50 contig sizes under 7,000. A summary of 
assemblies, including the Scarlet Macaw and Puerto Rican 
Parrot, are shown in Table 2. Apart from the standard 
genome assembly quality statistics, we assessed the quality 



of the budgerigar assemblies along two other dimensions: 
(1) the coverage of highly conserved avian exons, and (2) 
the number of gaps 10 Kbp upstream and downstream of 
each gene (gene territories), and conversely, the number 
gene territories assembled without gaps. Of 3,288 highly 
conserved exons (>86% coverage across >87% of their 
length) we identified between chicken and zebra finch, 
3,165 (96.25%) and 3,134 (95.31%) were covered with >86% 
identity across >87% of their length in the Budgerigar_v6.3 
and PBcR assemblies respectively, pointing to good coverage 
of coding regions in these assemblies. The budgerigar as- 
semblies had fewer gaps within the coding sequences and 
gene territories than all other avian genomes examined, ex- 
cept the newer unpublished Galgal4 chicken assembly that 
is similar to the budgerigar in that it is a hybrid that includes 
both short and long sequences (Sanger and 454 FLX+) 
(Figure 2). This suggests that our budgerigar assemblies 
have very well assembled genes and promoter regions. 



budgerigar 




Figure 2 Number of nucleotide gaps assess relative assembly incompleteness. A) Shows the total number of gaps in genes and the 
surrounding 10,000 base pair regions upstream and downstream (collectively called gene territories). B) Shows the number of such gene 
territories with gaps. In both the panels, different species assemblies are colored differently, with the budgerigar assemblies shown in dark blue. 
The budgerigar assemblies with the "-mega" suffix are optical map enhanced versions of the Budgerigar_v6.3 and PBcR assemblies. The 
budgerigar assemblies have the highest numbers of gapless gene territories (right panel) and the fewest number of gaps of all assemblies except 
the recent chicken v4 assembly, which used a similar technology (left panel). 
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Using the online CoGe tool [22-24], we assessed the 
structural similarities between the various budgerigar as- 
semblies and other avian assemblies [25-30], by computing 
the level of coding sequence synteny among assemblies. 
The highest numbers of genes in synteny were observed, 
as expected, between a budgerigar assembly and the optical 
map assisted version of the same assembly (Figure 3A). 
However, the number of genes in synteny between the 
Budgerigar_v6.3 and the PBcR assemblies was similar to 
the number of genes in synteny between budgerigar and 
falcon (Figure 3A, B). Further, the number of genes in syn- 
teny did not strictly reflect phylogenetic relationships, as 
the zebra finch and budgerigar, close relatives [19], had a 
lower level of synteny than budgerigar and chicken. In 
addition, a number of inversions were observed even in the 
syntenic dotplots between the original budgerigar assem- 
blies and their optical map-assisted assemblies (88 inver- 
sions between Budgerigar_v6.3 and Budgerigar_v6.3_mega; 
209 inversions between PBcR and PBcR mega, plots 
shown in GigaDB [2]). This suggests that synteny based on 
CoGE syntenic maps is affected by the quality of the as- 
semblies and the characteristics of the synteny algorithm. 
Thus, the number of genes in synteny computed using the 
available methods is only a rough measure of the actual 
structural similarity between the assemblies compared. 

In summary, this study shows that the budgerigar gen- 
omic resource we have generated has provided [5,6] (and is 
still expected to provide more) valuable data and material 
for genome technology development and for further inves- 
tigating complex behavioral traits at the genomics level. 



All procedures on live animals were approved by the 
Institutional Animal Care and Use Committee of Duke 
University. 

Availability and requirements 

The genomic sequence reads have been deposited in NCBI's 
sequence read archives (SRA) and the EBIs ENA archive, 
under the same project accession number ERP002324. The 
SOAPdenovo2 assembly has been submitted to GigaDB by 
the Assemblathon 2 team and is available at GigaDB [18]. 
Other supporting resources that have been deposited in 
GigaDB [2] are: 

• Duke University brain transcriptome reads. 

• Budgerigar_v6.3, PBcR assemblies (contigs and 
scaffolds) and optical map assisted megascaffolds 
based on these two assemblies (two contigs and four 
scaffolds in total). 

• The per base sequence quality distribution of the 
paired end and mate paired libraries. The estimated 
fragment length distribution of the mate paired 
libraries. Peptide and coding sequences (CDS) for 
the Budgerigar_v6.3 and PBcR assemblies. 

• Gene annotations and Repeat Masker annotations 
for the scaffolds. 

• Optical map alignments of Budgerigar_v6.3 and 
PBcR assemblies in Microsoft Excel and XML 
formats and software (Gnomspace.rar) to view the 
XML alignments. 

• The optical map dataset. 



Assembly groups 




" <*r ^ ^ 

,»* ,4? 




- J** ^ X 



A 



B 



Figure 3 The number of genes that are part of a syntenic block between different budgerigar assemblies (A) and between budgerigar 
and non-budgerigar assemblies (B). The numbers were calculated from CoGE syntenic dotplots (not shown), as the total number of genes 
represented in syntenic blocks. The y-axis limits have been cut off close to the minimum value in the plot to show a more detailed spread 
of values. 
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Additional file 



Additional file 1: Supplementary materials. 
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CABOG: Celera assembler with the best overlap graph; CoGE: Comparative 
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